Integrating Information to Bootstrap Information Extraction from Web Sites

نویسندگان

  • Fabio Ciravegna
  • Alexiei Dingli
  • David Guthrie
  • Yorick Wilks
چکیده

In this paper we propose a methodology to learn to extract domain-specific information from large repositories (e.g. the Web) with minimum user intervention. Learning is seeded by integrating information from structured sources (e.g. databases and digital libraries). Retrieved information is then used to bootstrap learning for simple Information Extraction (IE) methodologies, which in turn will produce more annotation to train more complex IE engines. All the corpora for training the IE engines are produced automatically by integrating information from different sources such as available corpora and services (e.g. databases or digital libraries, etc.). User intervention is limited to providing an initial URL and adding information missed by the different modules when the computation has finished. The information added or delete by the user can then be reused providing further training and therefore getting more information (recall) and/or more precision. We are currently applying this methodology to mining web sites of Computer Science departments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

An introduction to methods of discovering and identifying ancient sites with emphasis on evidence and geomorphologic techniques

Recognizing of position of ancient sites, it is of the great help to archaeologist. After this recognition, the archaeologist with rely on the knowledge and usual techniques in archaeology can determine the range of sites. After the discovery of this information, the archaeologist can get the information about the social, economic, livelihood and political of the past of sites. In this researc...

متن کامل

Grammatical inference for information extraction and visualisation on the Web

The world-wide web contains a wealth of database-style information scattered across different sites that could be better used if it were integrated into a single view. Since document formats vary widely between sites and frequently mingle structural with presentation markup, extracting and integrating data from web pages is a difficult challenge. Manually writing extraction wrappers is expensiv...

متن کامل

Mining Web Sites Using Unsupervised Adaptive Information Extraction

Adaptive Information Extraction systems (IES) are currently used by some Semantic Web (SW) annotation tools as support to annotation (Handschuh et al., 2002; Vargas-Vera et al., 2002). They are generally based on fully supervised methodologies requiring fairly intense domain-specific annotation. Unfortunately, selecting representative examples may be difficult and annotations can be incorrect a...

متن کامل

Learning to Harvest Information for the Semantic Web

In this paper we describe a methodology for harvesting information from large distributed repositories (e.g. large Web sites) with minimum user intervention. The methodology is based on a combination of information extraction, information integration and machine learning techniques. Learning is seeded by extracting information from structured sources (e.g. databases and digital libraries) or a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003